cloud system
From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
Luo, Yu, Jiang, Jiamin, Feng, Jingfei, Tao, Lei, Zhang, Qingliang, Wen, Xidao, Sun, Yongqian, Zhang, Shenglin, Pei, Dan
Abstract--Incident management (IM) is central to the reliability of large-scale cloud systems. Y et manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observ-ability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. T o support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA [1] benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems. Cloud systems have become the de facto platform for modern software services, with wide deployments across industries such as IT, government, and finance [2]. However, incidents (e.g., service disruptions and outages) [2], [3] are inevitable due to the complexity of cloud systems, often resulting in catastrophic economic and operational consequences. Google Cloud triggered a global outage that lasted nearly eight hours, disrupting more than 80 GCP services and cascading into failures across e-commerce, finance, AI applications, entertainment platforms, and transportation systems worldwide. The economic impact of this incident was substantial, as it encompassed not only Google's direct losses but also widespread hidden costs borne by countless enterprises and end users affected by the disruption [4]. Traditionally, on-call engineers (OCEs) manually inspect metrics, logs, and traces to identify the root cause when incidents occur [2].
- Health & Medicine > Diagnostic Medicine (0.67)
- Information Technology > Services (0.54)
Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset
Islam, Mohammad Saiful, Rakha, Mohamed Sami, Pourmajidi, William, Sivaloganathan, Janakan, Steinbacher, John, Miranskyy, Andriy
As Large-Scale Cloud Systems (LCS) become increasingly complex, effective anomaly detection is critical for ensuring system reliability and performance. However, there is a shortage of large-scale, real-world datasets available for benchmarking anomaly detection methods. To address this gap, we introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. Additionally, we demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process. This study and the accompanying dataset provide a resource for researchers and practitioners in cloud system monitoring. It facilitates more efficient testing of anomaly detection methods in real-world data, helping to advance the development of robust solutions to maintain the health and performance of large-scale cloud infrastructures.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > New York (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
Kuang, Jinxi, Liu, Jinyang, Huang, Junjie, Zhong, Renyi, Gu, Jiazhen, Yu, Lan, Tan, Rui, Yang, Zengyin, Lyu, Michael R.
Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.
- Europe > Portugal > Lisbon > Lisbon (0.05)
- Asia > China > Hong Kong (0.04)
- North America > United States (0.04)
- Research Report > Promising Solution (0.34)
- Research Report > New Finding (0.34)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction
Li, Haozhe, Ma, Minghua, Liu, Yudong, Zhao, Pu, Zheng, Lingling, Li, Ze, Dang, Yingnong, Chintalapati, Murali, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
With the rapid growth of cloud computing, a variety of software services have been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on failure instance (disk, node, and switch, etc.) prediction. Once the output of prediction is positive, mitigation actions are taken to rapidly resolve the underlying failure. According to our real-world practice in Microsoft Azure, we find that the prediction accuracy may decrease by about 9% after retraining the models. Considering that the mitigation actions may result in uncertain positive instances since they cannot be verified after mitigation, which may introduce more noise while updating the prediction model. To the best of our knowledge, we are the first to identify this Uncertain Positive Learning (UPLearning) issue in the real-world cloud failure prediction scenario. To tackle this problem, we design an Uncertain Positive Learning Risk Estimator (Uptake) approach. Using two real-world datasets of disk failure prediction and conducting node prediction experiments in Microsoft Azure, which is a top-tier cloud provider that serves millions of users, we demonstrate Uptake can significantly improve the failure prediction accuracy by 5% on average.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Log-based Anomaly Detection based on EVT Theory with feedback
Liu, Jinyang, Huang, Junjie, Huo, Yintong, Jiang, Zhihan, Gu, Jiazhen, Chen, Zhuangbin, Feng, Cong, Yan, Minzhi, Lyu, Michael R.
System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > China > Hong Kong (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
Scaling Data Science Solutions with Semantics and Machine Learning: Bosch Case
Zhou, Baifan, Nikolov, Nikolay, Zheng, Zhuoxun, Luo, Xianghui, Savkovic, Ognjen, Roman, Dumitru, Soylu, Ahmet, Kharlamov, Evgeny
Industry 4.0 and Internet of Things (IoT) technologies unlock unprecedented amount of data from factory production, posing big data challenges in volume and variety. In that context, distributed computing solutions such as cloud systems are leveraged to parallelise the data processing and reduce computation time. As the cloud systems become increasingly popular, there is increased demand that more users that were originally not cloud experts (such as data scientists, domain experts) deploy their solutions on the cloud systems. However, it is non-trivial to address both the high demand for cloud system users and the excessive time required to train them. To this end, we propose SemCloud, a semantics-enhanced cloud system, that couples cloud system with semantic technologies and machine learning. SemCloud relies on domain ontologies and mappings for data integration, and parallelises the semantic data integration and data analysis on distributed computing nodes. Furthermore, SemCloud adopts adaptive Datalog rules and machine learning for automated resource configuration, allowing non-cloud experts to use the cloud system. The system has been evaluated in industrial use case with millions of data, thousands of repeated runs, and domain users, showing promising results.
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Italy (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- (4 more...)
- Information Technology > Services (0.47)
- Information Technology > Software (0.34)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.92)
Sechrist Industries, Inc. Introduces New Cloud System
Sechrist Industries, Inc., the pioneer of monoplace hyperbaric chambers and air/oxygen mixers, has introduced their new, proprietary Sechrist Cloud System. Unique to Sechrist Industry customers, the Hyperbaric Information Tracking System has all the key and important information about customers' Sechrist Monoplace Hyperbaric Systems always at their fingertips, available online, 24/7. Company President, Deepak Talati remarked: "Sechrist believes that coming up with solutions to make the workload for clinicians and technicians easier is important so that more time can be spent caring for patients. The Sechrist Cloud Hyperbaric Information Tracking System puts all key chamber information in one easy to access location eliminating the need for binders and paper. Our goal at Sechrist is to make record keeping paperless and always accessible. The Sechrist Cloud System is designed to provide our customers with more time for patient care and less time managing paper."
Construction Industry Top 10 Trends in the Next Decade
AEM presented 10 top trends for the future of building construction, among them alternative power, the electrification of compact equipment, autonomous machinery and sensors for increased safety. Referencing recent aviation fuel regulations plans, the California Air Resources Board's (CARB) ban on small engines on new equipment starting in 2024, the Environmental Protection Agency's (EPA) new greenhouse gas emissions rules for 2023–2026 passenger vehicles and light-duty trucks and the EPA's plan to reduce greenhouse gas emissions from heavy-duty trucks starting with 2027 models, the AEM whitepaper asserts that construction companies will see their fleets change over the next decade, as well. Major corporations continue to invest in renewable energy like biofuels, solar and wind power, as construction companies and large contractors commit to net-zero impact pledges for new buildings and infrastructure. The United States' commitment to cutting carbon emissions by 50% by 2030 will spur "the electrification of many segments of the compact construction equipment market" over the next 10 years, according to AEM. Thanks to the advanced 5G network and cloud systems, equipment tracking will allow real-time visibility into productivity and maintenance on a Jobsite, so operators and contractors can make sure they queue properly and have the most efficient job flow they can.
- Law > Environmental Law (1.00)
- Construction & Engineering (1.00)
- Government > Regional Government > North America Government > United States Government (0.94)
- Energy > Renewable > Wind (0.57)
Cloud Intelligence/AIOps – Infusing AI into Cloud Computing Systems - Microsoft Research
When legendary computer scientist Jim Gray accepted the Turing Award in 1999, he laid out a dozen long-range information technology research goals. One of those goals called for the creation of trouble-free server systems or, in Gray's words, to "build a system used by millions of people each day and yet administered and managed by a single part-time person." Gray envisioned a self-organizing "server in the sky" that would store massive amounts of data, and refresh or download data as needed. Today, with the emergence and rapid advancement of artificial intelligence (AI), machine learning (ML) and cloud computing, and Microsoft's development of Cloud Intelligence/AIOps, we are closer than we have ever been to realizing that vision--and moving beyond it. Over the past fifteen years, the most significant paradigm shift in the computing industry has been the migration to cloud computing, which has created unprecedented digital transformation opportunities and benefits for business, society, and human life.
Leading edge computing companies of 2022
Edge computing refers to a solution where data processing, analysis and in some cases, actions, occur close to the place where the data originated. Edge computing often relies on a sporadic connection to cloud computing systems, although some setups similarly connect to nearby devices -- in which case the systems might be referred to as part of the Internet of Things (IoT). Edge computing solutions operate in circumstances where current cloud computing systems won't suffice, due to one or more of the following concerns: Wherever you encounter one or more of the above four constraints, you'll also find an example of an edge computing solution. Machines, such as autonomous cars or industrial robots, generate huge quantities of data and act with low latency. Some agricultural systems operate in areas that lack high-bandwidth network connections.
- North America > United States (0.04)
- North America > Aruba (0.04)
- Information Technology > Services (1.00)
- Transportation > Ground > Road (0.49)
- Information Technology > Communications > Networks (1.00)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.34)